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tion apparatus extracts features from an input speech frame 
and then provides the extracted features to a central pro- 
cessing station. In the central processing station, the features 
are provided to a word decoder which determines the syntax 
of the input speech frame. 
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DISTRIBUTED VOICE RECOGNITION removed and a digital windowing operation is performed on 

SYSTEM ^^t^b frame to lessen the blocking effects due to the discon- 

tinuity at firame boundaries. A most commonly used window 
This application is a continuaUon of application Ser. No. function in IPC analysis is the Hamming window, w(n) 



08/534,080, filed Sep. 21, 1995 which is a continuation of 5 defined as: 
Ser. No. 08/173,247, filed Dec. 22, 1993. 



HELD OF THE UvTVENTION 



H<rt) = 034-0.46 cos| J^j, 0 s n i - 1 



The present invention relates to speech signal processing. 

More particularly, the present invention relates to a novel The windowed speech is provided to LPC analysis element 

method and apparaUis for realizing a distributed impleraen- 8. In LPC analysis element 8 autocorrelation functions are 

tation of a standard voice recognition system. calculated based on the windowed samples and correspond- 
ing LPC parameters are obtained directly from autocorrela- 

DESCRIPTION OF THE RELATED ART tion functions. 

15 Generally speaking, the word decoder translates the 

Voice recognition represents one of the most unportant acoustic feature sequence produced by the acoustic proces- 

techniques to endow a machine with simulated inteUigence gor into an estimate of the speaker's original word string, 

to recognize user voiced commands and to facilitate human accomplished in two steps: acoustic pattern matching 

interface with the machine. It also represents a key technique and language modeling. Language modeling can be avoided 

for human speech understanding. Systems that employ tech- 20 in the applications of isolated word recognition. The LPC 

niques to recover a linguistic message from an acoustic parameters from LPC analysis element 8 are provided to 

speech signal are called voice recognizers (VR). A voice acoustic pattern matching element 10 to detect and classify 

recognizer is composed of an acoustic processor, which possible acoustic patterns, such as phonemes, syllables, 

extracts a sequence of information-bearing features words, etc. The candidate patterns are provided to language 

(vectors) necessary for VR from the incoming raw speech, 25 modeling element 12, which models the rules of syntactic 

and a word decoder, which decodes this sequence of features constraints that determine what sequences of words are 

(vectors) to yield the meaningful and desired format of grammatically well formed and meaningful. Syntactic infor- 

output, such as a sequence of linguistic words corresponding mation can be a valuable guide to voice recognition when 

to the input utterance. To increase the performance of a acoustic information alone is ambiguous. Based on language 

given system, training is required to eqmp the system with 3Q modeling, the VR sequentially interprets the acoustic feature 

valid parameters. In other words, the system needs to learn matching results and provides the estimated word string, 

before it can function optimally. Both the acoustic pattern matching and language model- 

The acoustic processor represents a front end speech ing in the word decoder requires a mathematical model, 
analysis subsystem in a voice recognizer. In response to an either deterministic or stochastic, to describe the speaker's 
input speech signal, it provides an appropriate representation 35 phonological and acoustic-phonetic variations. The perfor- 
to characterize the time-varying speech signal. It should mance of a speech recognition system is directly related to 
discard irrelevant information such as background noise, the quality of these two modelings. Among the various 
channel distortion, speaker characteristics and manner of classes of models for acoustic pattern matching, template- 
speaking. Efficient acoustic feature extraction systems will based dynamic time warping (DTW) and stochastic hidden 
furnish voice recognizers with higher acoustic discrimina- 40 Markov modeling (HMM) are the two most commonly used, 
tion power. The most useful characteristic is the short time However, it has been shown that DTW based approach can 
specU-al envelope. In characterizing the short time spectral be viewed as a special case of an HMM based one, which is 
envelope, the two most commonly used spectral analysis ^ parametric, doubly stochastic model. HMM systems are 
techniques are linear predictive coding (LPC) and filter-bank currently the most successful speech recognition algorithms, 
based spectral analysis models. However, it is readily shown 45 The doubly stochastic property in HMM provides better 
(as discussed in Rabiner, L. R and Schafer, R. W., Digital flexibility in absorbing acoustic as well as temporal varia- 
Processing of Speech Signals, Prentice Hall, 1978) that LPC tions associated with speech signals. This usually results in 
not only provides a good approximation to the vocal tract improved recognition accuracy. Concerning the language 
specU-al envelope, but is considerably less expensive in model, a stochastic model, called k-gram language model 
computation than the filler-bank model in all digital imple- 50 which is detailed in F. Jelink, ''The Development of an 
mentations. Experience has also demonstrated that the per- Experimental Discrete Dictation Recognizer"*, Proc. IEEE, 
formance of LPC based voice recognizers is comparable to vol. 73, pp. 1616-1624, 1985, has been successfully applied 
or better than that of filter-bank based recognizers (Rabiner, in practical large vocabulary voice recognition systems. 
L. R. and Juang, B. H., Fundamentals of Speech While in the small vocabulary case, a deterministic grammar 
Recognition, Prentice Hall, 1993). 55 has been formulated as a finite state network (FSN) in the 

Referring to FIG. 1, in an LPC based acoustic processor, appUcation of airline and reservation and information sys- 
the input speech is provided to a microphone (not shown) (^ee Rabiner, L. R and Levinson, S. Z., A Spcaker- 
and converted to an analog electrical signal. This electrical Independent, Syntax-Directed, Connected Word Recogni- 
signal is then digitized by an A/D converter (not shown). The ^ion System Based on Hidden Markov Model and Level 
digitized speech signals are passed through preemphasis 60 BuUding, lEEETrans. on lASSP, Vol. 33.No. 3, June 1985.) 
filter 2 in order to spectrally flatten the signal and to make Statistically, in order to minimize the probability of rec- 
it less suscepU*ble to finite precision effects in subsequent ognition error, the voice recognition problem can be formal- 
signal processing. The preemphasis filtered speech is then ^^^^ ^ follows: with acoustic evidence observation O, the 
provided to segmentation element 4 where it is segmented or operations of voice recognition are to find the most likely 
blocked into either temporally overlapped or nonoverlapped 65 ^^^^ ^^^^ ^^^^ ^^^^ 
blocks. The frames of speech data are then provided to 

windowing element 6 where framed DC components are w-arg max p(wlo) (i) 
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where the maximization is over all possible word strings W. phonemes, syllables or demisyllables may be used as the 

In accordance with Bayes rule, the posteriori probability basic speech unit. The word decoder will be modified 

P(W|0) in the above equation can be rewritten as: accordingly. 

Conventional voice recognition systems integrate acous- 

f\W)i\o\w) (2) 5 tic processor and word decoders without taking into account 

P{W\0)= jjjgjj. separability, the limitations of application systems 

(such as power consumption, memory availability, etc.) and 
communication channel characteristics. This motivates the 

Since P(0) is irrelevant to recognition, the word string interest in devising a distributed voice recognition system 

estimate can be obtained alternatively as: these two components appropriately separated. 

pAv^prntur^ SUMMARY OF THE INVENTION 

W*=arg max P(W)P(Op^') (3) 

Here P(W) represents the a priori probability that the word '^^ present invention is a novel and improved distributed 

string W will be uttered, and P(0|W) is the probability that „ recognition system, in which (.) the front end acousuc 

the acoustic evidence 0 will be observed given that the P^'^f^" *=^° ^ ^PC based or Alter bank based; (ii) the 

speaker uttered the word sequence W. P(0|W) is determined ""'"n"? P^w" 'l"'''*''''!!^.'^!^^' "[J?'^ 5=*" ^ 

by acoustic pattern matching, while the a priori probability ^J^'""" 

PCW) is defined by language model utilized. (^TW) or even neural networks (NN); and (lu) for the 

In connected word recognition, if the vocabulary is smaU e°°°*='^'' ?°°'TT ^"'S''."'^'! P*^**: 

(less than 100). a deterministic grammar can be used to ''"'g"''^* ""e?^' ^ °° detemmisnc or stochastic 

rigidly govern which wordscan logically followotherwords grammars. The present invention differs from the usual 

to form legal sentences in the language. The deterministic """"^ recopizer m miproving system performance by 

grammar can be incorporated in the acoustic matching "PP^Pn^tely separatmg the components: feature extracUon 

algorithm implicitly to constrain the search space of poten- 25 jnd word decoding. As demonstrated in next examples, if 

tial words and to reduce the computation dramaUcaUy. LPC based features, such cepstrum coefficients, are to be 

However, when the vocabulary size is either medium sent over rammunication channe . a transformaUon be^^^^ 

(greater than 100 but less than 1000) or laige (greater than ^SP can be used to alleviate the noise effects on 

1000), the probability of the word sequence, W=(w„W2, . . sequence. 

• , wj, can be obtained by stodiastic language modeling. 30 BRIEF DESCRIPTION OF THE DRAWINGS 
From simple probability theory, the prior probability, P(W) 

can be decomposed as The features, objects, and advantages of the present 

invention will become more apparent from the detailed 

n (4) description set forth below when taken in conjunction with 

i\W) = P{w,,w2 w„)=I^/X>>'il'»'i,»>'2 Wi-i) 35 the drawings in which like reference characters identify 

'"' correspondingly throughout and wherein: 

FIG. 1 is a block diagram of a traditional speech recog- 

where P(wJwi,W2 w^.^) is the probability that w,- will nition system; 

be spoken given that the word sequence (wi.Wj, . . . ,w,..i) FIG. 2 is a block diagram of an exemplary implementa- 

precedes it. llie choice of w,. depends on the entire past 40 tion of the present invention in a wireless communication 

history of input words. For a vocabulary of size V, it requires environment; 

V values to specify P(wjw„w„ . , . ,w. J completely. Even 3 ^3 , ^^^^^^ block diagram of the present invention; 

for the mid vocabulary size, this requires a formidable ^ • , 1 1 «• r- , . 

u^^^r^ » • *u 1 „ A * FIG. 4 IS a block diagram of an exemplary embodiment of 

number 01 samples to tram the language model. An inaccu- ... ^ , . r , .r. 

rot^ ^ct™ot^ «f DA,, k„ \ A..^ f« 11^^ transform element and inverse transform element of the 

rate estimate or P(w^[w^,W2, . . . ,w^. J due to insumaent 45 , . , 

training data wiU depreciate the results of original acoustic P^^°^ mvenlion; and 

matching. P^^* 5 is a block diagram of a preferred embodiment of 

A practical solution to the above problems is to assume present invention comprising a local word detector in 

that w, only depends on (k-1) preceding words, w,..,.w,-.2. . addition to a remote word detector. 

■ . >w, *,i. A stochastic lanpage model can be completely 50 DETAILED DESCRIPTION OF THE 

described m terms of P(w,jw,,w, . ,w,.,, J from which PREFERRED EMBODIMENTS 
k-gram language model is derived. Smce most of the word 

strings will never occur in the language if k>3, unigram Jo a standard voice recognizer, either in recognition or in 

(k-1), bigram (k«2) and trigram (k-3) are the most powerful training, most of the computational complexity is concen- 

stochastic language models that take grammar into consid- 55 trated in the word decoder subsystem of the voice recog- 

eration statistically. Language modeling contains both syn- iiizer. In the implementation of voice recognizers with 

tactic and semantic information which is valuable in distributed system architecture, it is often desirable to place 

recognition, but these probabilities must be trained from a tbe word decoding task at the subsystem which can absorb 

large collection of speech data. When the available training Itie computational load appropriately. Whereas the acoustic 

data are relatively limited, such as K-grams may never occur 60 processor should reside as close to the speech source as 

in the data, P(wjw.,2,w..j) can be estimated directly from possible to reduce the effects of quantization errors intro- 

bigram probability P(wjw^.^. Details of this process can be duced by signal processing and/or channel induced errors, 

found in F. Jelink, ''The Development of An Experimental An exemplary implementation of the present invention is 

Discrete Dictation Recognizer'*, Proc. IEEE, vol. 73, pp. illustrated in FIG. 2. In the exemplary embodiment, the 

1616-1624, 1985. In connected word recognition, whole 65 environment is a wireless communication system compris- 

word model is used as the basic speech imit, while in ing a portable cellular telephone or personal communica- 

continuous voice recognition, subband units, such as lions device 40 and a central communications center refenred 
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to as a cell base Station 42. In the exemplary embodiment the from central communications center 42 need not be an 

distributed VR system is presented. In the distributed VR the interpretation of the transmitted speech, rather the informa- 

acoustic processor or feature extraction element 22 resides tion sent back from central communications center 42 may 

in personal communication device 40 and word decoder 48 be a response to the decoded message sent by the portable 
resides in the central communications center. If, instead of 5 phone. For example, one may inquire of messages on a 

distributed VR, VR is implemented solely in portable eel- remote answering machine (not shown) coupled via a com- 

lular phone it would be highly infeasible even for medium munications network to central communications center 42, 

size vocabulary, connected word recognition due to high in which case the signal transmitted from central commu- 

computation cost. On the other hand, if VR resides merely nications center 42 to portable telephone 40 may be the 

at the base station, the accuracy can be decreased dramati- messages from the answering machine in this implementa- 

cally due to the degradation of speech signals associated tion. Asecond control element 49 would be collocated in the 

with speech codec and channel effects. Evidently, there are central communications center. 

three advantages to the proposed distributed system design. -^phe significance of placing the feature extraction element 

The first is the reduction in cost of the cellular telephone due 22 in portable phone 40 instead of at central communica- 

to the word decoder hardware that is no longer resident in tions center 42 is as follows. If the acoustic processor is 

the telephone 40. The second is a reduction of the drain on placed at central communications center 42, as opposed to 

the battery (not shown) of portable telephone 40 that would distributed VR, a low bandwidth digital radio channel may 

result from locally performing the computationally intensive require a vocoder (at the first subsystem) which limits 

word decoder operation. The third is the expected improve- resolution of features vectors due to quantization distortion, 

ment in recognition accuracy in addition to the flexibility However, by putting the acoustic processor in the portable or 

and extendibility of the distributed system. cellular phone, one can dedicate entire channel bandwidth to 

The speech is provided to microphone 20 which converts feamre transmission. Usually, the extracted acoustic feaUire 

the speech signal into electrical signals which are provided vector requires less bandwidth than the speech signal for 

to feature extraction element 22. The signals from micro- transmission. Since recognition accuracy is highly depen- 
phone 20 may be analog or digital. If the signals are analog, ^5 dent on degradation of input speech signal, one should 

then an analog to digital converter (not shown) may be provide feature extraction element 22 as close to the user as 

needed to be interposed between microphone 20 and feature possible so that feature extraction element 22 extracts fea- 

extraction element 22. The speech signals are provided to ture vectors based on microphone speech, instead of 

feature extraction element 22. Feature extraction element 22 (vocoded) telephone speech which may be additionally 

extracts relevant characteristics of the input speech that will corrupted in transmission. 

be used to decode the linguistic interpretation of the input in real applications, voice recognizers are designed to 

speech. One example of characteristics that can be used to operate under ambient conditions, such as background noise, 

estimate speech is the frequency characteristics of an input Tbus^ jt is important to consider the problem of voice 

speech frame. This is frequenUy provided as linear predic- recognition in the presence of noise. It has been shown that, 
tive coding parameters of the input frame of speech. The 35 if the training of vocabulary (reference patterns) is per- 

extracted features of the speech are then provided to trans- formed in the exact (or approximate) same environment as 

mitter 24 which codes, modulates and amplifies the the test condition, voice recognizers not only can provide 

extracted feature signal and provides the features through good performance even in very noisy environments, but can 

duplexer 26 to anteima 28, where the speech features are reduce the degradation in recognition accuracy due to noise 
transmitted to cellular base station or central communica- ^ significantly. The mismatch between training and test con- 

tions center 42. Various types of digital coding, modulation, ditions accounts for one of the major degradation factors in 

and transmission schemes well known in the art may be recognition performance. With the assumption that acoustic 

employed. features can traverse communication channels more reliably 

At central communications center 42, the transmitted than speech signals (since acoustic features require less 
features are received at antenna 44 and provided to receiver 45 bandwidth than speech signals for transmission as men- 

46. Receiver 46 may perform the fiinctions of demodulating tioned previously), the proposed distributed voice recogni- 

and decoding the received transmitted features which it tion system is advantageous in providing matching condi- 

provides to word decoder 48. Word decoder 48 determines, tions. If a voice recognizer is implemented remotely, the 

from the speech features, a linguistic estimate of the speech matching conditions can be badly broken due mainly to 
and provides an action signal to transmitter 50. Transmitter 50 channel variations such as fading encountered in wireless 

50 performs the functions of amplification, modulation and communications. Implementing VR locally may avoid these 

codingof the action signal, and provides the amplified signal effects if the huge training computations can be absorbed 

to antenna 52, which transmits the estimated words or a locally. Unfortunately, in many applications, this is not 

command signal to portable phone 40. Transmitter 50 may possible. Obviously, distributed voice recognition imple- 
also employ well known digital coding, modulation or 55 mentation can avoid mismatch conditions induced by chan- 

transmission techniques. nel perplexity and compensate for the shortcomings of 

At portable phone 40, the estimated words or command centralized implementations, 
signals are received at antenna 28, which provides the Referring to FIG. 3, the digital speech samples are pro- 
received signal through duplexer 26 to receiver 30 which vided to feamre extraction element 51 which provides the 
demodulates, decodes the signal and then provides the 60 features over communication channel 56 to word estimation 
command signal or estimated words to control element 38. clement 62 where an estimated word string is determined. 
In response to the received command signal or estimated The speech signals are provided to acoustic processor 52 
words, control element 38 provides the intended response which detenmines potential features for each speech fi-ame. 
(e.g., dialing a phone number, providing information to Since word decoder requires acoustic feature sequence as 
display screen on the portable phone, etc.). 65 input for both recognition and training tasks, these acoustic 

The same system represented in FIG. 2 could also serve features need to be transmitted across the communication 

in a slightly different way in that the information sent back channel 56. However, not all potential features used in 
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typical voice recognition systems are suitable for transmis- 
sion through noisy channels. In some cases, transform ^ «2 ^ _^ (12) 
element 54 is required to facilitate source encoding and to ni) = (l + r*)]^|(i-2cos(H^_i)i-» +r ) 
reduce the effects of channel noise. One example of LPC 

based acoustic features which are widely used in voice 5 ^ (13) 

recognizers is cepstnim coefficients, {c/}. They can be CW = d -2cos(n'2,)2-» +r^) 

obtained directly from LPC coefficients {aj as follows: 



\Ckam-k, 



(5) A(2) = 1-2j«'2^ J 

10 ti ^ 



(14) 



m-1 , . (6) The LPC coefficients are then provided to LPC to cepstrum 

Cm - ^ I — p flm-i , m = p+i Q element 84 which provides the cepstrum coefficients to word 

decoder 64 in accordance with equations 5 and 6. 

Since the word decoder relies solely on an acoustic 
feature sequence, which can be prone to noise if transmitted 
where P is the order of LPC filter used and 0 is the size of direcUy through the communication channel, a potential 
cepstrum feature vector. Since cepstrum feature vectors acoustic feature sequence is derived and transformed in the 
change rapidly, it is not easy to compress a sequence of subsystem 51 as depicted in FIG. 3 into an alternative 
frames of cepstrum coefficients. However, there exists a 20 representation that facilitates transmission. The acoustic 
transformation between LPCs and line spectrum pair (LSP) Mature sequence for use in word decoder is obtained after- 
frequencies which changes slowly and can be efficiently ^ards through inverse transformation. Hence, in distributed 
encoded by a delta pulse coded modulation (DPCM) implementation of XOi the feati^e sequence sent tto^^ 
scheme. Since cepstrum coefficients can be derived directly (^^/"^f^ f ° ^^f^ the one really used m 

c rnry a= ' . i nr^ * f j • ♦ T cn u 25 word decoder. It IS anUcipa ted that the oulput from transform 
from LPC coefficients, LPCs are transformed mto LSPs by , l c j j u « 

, . \ . , , , , element 70 can be further encoded by any error protecUon 

transform element 54 which are then encoded to traverse the ^^^^^^ ^^^^ ^^^^^ 

communicauon channel 56. At remote word estunation in FIG. 5, an improved embodiment of the present inven- 
element 62, the transformed potential features are inverse illustrated. In wireless communication appHcations, 

transformed by mverse transforaa element 60 to provide 3^ ^^^^^ occupy the communication channel 

acoustic features to word decoder 64 which in response foj ^ small number of simple, but commonly used voiced 
provides an estimated word string. commands, in part due to expensive channel access. This can 

An exemplary embodiment of the transform element 54 is be achieved by further distributing the word decoding func- 
illustrated in FIG. 4 as transform subsystem 70. In FIG. 4, tion between handset 100 and base station 110 in the sense 
the LPC coefficients from acoustic processor 52 are pro- 35 that a voice recognition with a relatively small vocabulary 
vided to LPC to LSP transformation element 72. Within LPC size is implemented locally at handset while a second voice 
to LSP element 72 the LSP coefficients can be determined as recognition system with larger vocabulary size resides 
follows. For Pth order LPC coefficients, the corresponding remotely at base station. They both share the same acoustic 
LSP frequencies are obtained as the P roots which reside processor at handset. The vocabulary table in local word 
between 0 and n of the foUowing equations: ^^^^^ «>°tains most widely used words or word strings. 

The vocabulary table m remote word decoder, on the other 
hand, contains regular words or word strings. Based on this 
infrastructure, as illustrated in FIG. 5, the average time that 
p(»v)=cos 5w+p, cos 4HM- . . . +P3/2 (7) channel is busy may be lessened and the average recognition 

45 accuracy increased. 

Q{w)=cos S^q, cos 4>v+ . . . +^5/2 (8) Moreovcr, there will exist two groups of voiced com- 

mands available, one, called special voiced command, cor- 
where p,- and % can be computed recursively as follows: responds to the commands recognizable by local VR and the 

other, called regular voiced command, corresponds to those 
50 not recognized by the local VR, Whenever a special voiced 
Po-qo=i (P) command is issued, the real acoustic features are extracted 

for local word decoder and voice recognition function is 
P.'^^i-^p-rPi-i>^^i=PI^ (10) performed locally without accessing communication chan- 

nel. When a regular voiced command is issued, transformed 
9^— ai+flp.f-g,.i4S|^p/2 (n) 55 acoustic feature vectors are transmitted through chaimel and 

word decoding is done remotely at base station. 
The LSP frequencies are provided to DPCM element 74 since the acoustic features need not be transformed, nor 
where they are encoded for transmission over communica- ^^^d, for any special voiced command and vocabulary 

tion channel 76. si^e is small for local VR, the required computation wQl be 

At inverse transformation element 78, the received signal 60 much less than that of the remote one (the compulation 
from the channel is passed through inverse DPCM element associated with the search for correct word string over 
80 and LSP to LPC element 82 to recover the LSP frequen- possible vocabularies is proportional to vocabulary size), 
cies of the speech signal. The inverse process of LPC to LSP Additionally, the local voice recognizer may be modeled 
element 72 is performed by LSP to LPC element 82 which with a simpUfied version of HMM (such as with a lower 
converts the LSP frequencies back into LPC coefficients for 65 number of states, a lower number of mixture components for 
use in deriving the cepstrum coefficients. LSP to LPC state output probabilities, etc.) compared to remote VR, 
element 82 performs the conversion as follows: since the acoustic feature will be fed directly to local VR 
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without potential corruption in the channel. This wiU enable 
a local, though, limited vocabulary, implementation of VR at 
the handset where the computational load is limited. It is 
envisioned that the distributed VR structure can also be used 
in other target applications different than wireless commu- 
nication systems. 

Referring to FIG. 5, speech signals are provide to acoustic 
processor 102, which then extracts features, for example 
LPC based feature parameters, from the speech signal. 
These features are then provided to local word decoder 106 
which searches to recognize the input speech signal from its 
small vocabulary. If it fails to decode the input word string 
and determines that remote VR should decode it, it signals 
transform element 104 which prepares the features for 
transmission. The transformed features are then transmitted 
over communication channel 108 to remote word decoder 
110. The transformed features are received at inverse trans- 
form element 112, which performs the inverse operation of 
transform element 104 and provides the acoustic features to 
remote word decoder element 114 which in response pro- 
vides the estimate remote word string. 

The previous description of the preferred embodiments is 
provided to enable any person skilled in the art to make or 
use the present invention. The various modifications to these 
embodiments will be readily apparent to those skilled in the 
art, and the generic principles defined herein may be applied 
to other embodiments without the use of the inventive 
faculty. Thus, the present invention is not intended to be 
limited to the embodiments shown herein but is to be 
accorded the widest scope consistent with the principles and 
novel features disclosed herein. 

We claim: 

1. A remote station for use in a mobile communications 
system, comprising: 

feature extraction means located at the remote station for 
receiving a frame of speech samples and extracting a 
set of parameters for speech recognition; 

first word decoder means for receiving said set of param- 
eters and for extracting the meaning of said speech 
from said parameters in accordance with a small 
vocabulary; and 

transmission means for wirelessly transmitting a set of 
parameters that cannot be decoded by said first word 
decoder means to a receiving station having second 
word decoder means for extracting the meaning of said 
speech from the transmitted parameters in accordance 
with a larger vocabulary. 

2. The remote station of claim 1, further comprising a 
microphone for receiving an acoustical signal and for pro- 
viding said acoustical signal to said feature extraction 
means. 

3. The remote station of claim 1, further comprising 
transform means interposed between said feature extraction 
means and said transmitter means for receiving said set of 
parameters and for transforming said set of parameters into 
an alternative representation of said set of parameters in 
accordance with a predetermined transformation format. 

4. The remote station of claim 1, wherein said set of 
parameters comprises linear prediction coeflScients. 

5. The remote station of claim 1, wherein said set of 
parameters comprises line spectral pair values. 

6. The remote station of claim 3, wherein said set of 
parameters comprises linear prediction coefficients and said 
predetermined transformation format is a linear prediction 
coefficient to line spectral pair transformation. 

7. The remote station of claim 1, further comprising 
receiver means for receiving a response signal in accordance 
with a speech recognition operation upon said frame of 
speech. 
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8. The remote station of claim 7, further comprising 
control means for receiving said response signal and for 
providing a control signal in accordance with said response 
signal. 

9. The remote station of claim 1, further comprising a 
transform element interposed between and having an input 
coupled to an output of said feature extraction means and 
having an output coupled to an input of said transmission 
means. 

10. The remote station of claim 7, further comprising a 
control element having an input coupled to said receiver 
output and having an output for providing a control signal in 
accordance with said response signal. 

11. A central communications station for use in a mobile 
communications system, comprising: 

a word decoder located at said central communications 
station for receiving a set of speech parameters from a 
remote station physically separated from said central 
communications station and communicating therewith 
by wireless communications means, for performing a 
speech recognition operation on said set of speech 
parameters using a regular vocabulary associated with 
said word decoder located at said central communica- 
tions station, wherein said speech parameters are not 
recognizable by a local vocabulary associated with a 
word decoder located in a remote station; and 

a signal generator for generating a response signal based 
on a result of said speech recognition operation. 

12. The central communications station of claim 11, 
further comprising a receiver having an input for receiving 
a remote station speech parameter signal and for providing 
said remote station speech parameters to said word decoder 
means. 

13. The central communications station of claim 11, 
further comprising control means having an input coupled to 
said word decoder output and having an output for providing 
a control signal. 

14. A distributed voice recognition system, comprising: 
a local word decoder located at a subscriber station for 

receiving extracted acoustic features of a first frame of 
speech samples and for decoding said acoustic features 
in accordance with a small vocabulary; and 
a remote word decoder located at a central processing 
station physically separated from said subscriber sta- 
tion for receiving extracted acoustic features of a 
second frame and for decoding, in accordance with a 
regular vocabulary, larger than said small vocabulary, 
said acoustic features of said second frame which 
cannot be decoded by said local word decoder. 

15. The system of claim 14 further comprising a prepro- 
cessor for extracting said acoustic features of said frame of 
speech samples in accordance with a predetermined feamre 
extraction format and for providing said acoustic features. 

16. The system of claim 15 wherein said acoustic features 
are linear predictive coding (LPC) based parameters. 

17. The system of claim 15 wherein said acoustic features 
are cepstral coefficients. 

18. The system of claim 15 wherein said preprocessor 
comprises a voice coder (vocoder). 

19. The system of claim 18 wherein said vocoder is a code 
excited linear prediction (CELP) vocoder. 

20. The system of claim 18 wherein said vocoder is a 
hneai predictive coding (LPC) based vocoder. 

21. The system of claim 18 wherein said vocoder is a 
multi-band excitation (MBE) based vocoder. 

22. The system of claim 18 wherein said vocoder is an 
ADPCM vocoder. 
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23, The system of claim 14 further comprising: 

a transform element located at said subscriber station for 
receiving said acoustic features and for converting said 
acoustic features into transformed features in accor- 
dance with a predetermined transform format, said 
transformed features transmitted through a communi- 
cation channel to said central processing station; and 

an inverse transform element located at said central 
processing station for receiving said transformed fea- 
tures and for converting said transformed feamres into 
estimated acoustic features in accordance with a pre- 
determined inverse transform format, and for providing 
said estimated acoustic features to said remote word 
decoder. 

24. The system of claim 23 wherein said acoustic features 
are linear predictive coding (LPC) based parameters, said 
predetermined transform format converts said LPC based 
parameters to line spectral pair (LSP) frequencies, and said 
inverse transform format converts said LSP firequencies into 
LPC based parameters. 
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25. The system of claim 14 wherein said local word 
decoder performs acoustic pattern matching based on a 
hidden Markov model (HMM). 

26. The system of claim 14 wherein said remote word 
decoder performs acoustic pattern matching based on a 
hidden Markov model (HMM). 

27. The system of claim 14 wherein said local word 
decoder performs acoustic pattern matching based on 
Dynamic Time Warping (DTW). 

28. The system of claim 14 wherein said remote word 
decoder performs acoustic pattern matching based on 
Dynamic Time Warping (DTW)- 

29. The system of claim 14 wherein said subscriber 
station communicates with said central processing station by 
wireless communication means. 

30. The system of claim 29 wherein said wireless com- 
munication means comprises a CDMA communication sys- 
tem. 
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